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Abstract In this analytical study we derive the optimal unbiased value estimator 
(MVU) and compare its statistical risk to three well known value estimators: Temporal 
Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal 
Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU 
if the Markov Reward Process (MRP) is acyclic and show that both differ for most 
cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators 
that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The 
main reason being the probability measures with which the expectations are taken. 
These measure vary from state to state and due to the strong coupling by the Bellman 
equation it is typically not possible for a set of value estimators to be unbiased with 
respect to each of these measures. Furthermore, we derive relations of the MVU to 
MC and TD. The most important one being the equivalence of MC to the MVU and 
to LSTD for undiscounted MRPs in which MC has the same amount of information. 
In the discounted case this equivalence does not hold anymore. For TD we show that 
it is essentially unbiased for acyclic MRPs and biased for cyclic MRPs. We also order 
estimators according to their risk and present counter-examples to show that no general 
ordering exists between the MVU and LSTD, between MC and LSTD and between TD 
and MC. Theoretical results are supported by examples and an empirical evaluation. 

Keywords Optimal Unbiased Value Estimator • Maximum Likelihood Value 
Estimator • Sufficient Statistics • Rao-Blackwell Theorem 



1 Introduction 

One of the important theoretical issues in reinforcement learning are rigorous state- 
ments on convergence properties of so called value estimators (e.g. (Sutton, 1988), 
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(Watkins and Dayan, 1992), (Jaakkola et al., 1994), (Bradtke and Barto, 1996)) which 
provide an empirical estimate of the expected future reward for every given state. So 
far most of these convergence results were restricted to the asymptotic case and did not 
provide statements for the case of a finite number of observations. In practice, however, 
one wants to choose the estimator which yields the best result for a given number of 
examples or in the shortest time. 

Current approaches to the finite example case are mostly empirical and few non- 
empirical approaches exist. (Kearns and Singh, 2000) present upper bounds on the 
generalization error for Temporal Difference estimators (TD). They use these bounds to 
formally verify the intuition that TD methods are subject to a "bias- variance" trade-off 
and to derive "schedules" for estimator parameters. Comparisons of different estimators 
with respect to the bounds were not performed. The issue of bias and variance in 
reinforcement learning is also addressed in other works ((Singh and Dayan, 1998), 
(Mannor et al., 2007)). (Singh and Dayan, 1998) provide analytical expressions of the 
mean squared error (MSE) for various Monte Carlo (MC) and TD value estimators. 
Furthermore, they provide a software that yields the exact mean squared error curves 
given a complete description of a Markov Reward Process (MRP). The method can be 
used to compare different estimators for concrete MRPs. But it is not possible to prove 
general statements with their method. The most relevant works for our analysis are 
provided by (Mannor et al., 2007) and by (Singh and Sutton, 1996). 

In (Mannor et al., 2007) the bias and the variance in value function estimates is 
studied and closed-form approximations are provided for these terms. The approxima- 
tions are used in a large sample approach to derive asymptotic confidence intervals. 
The underlying assumption of normally distributed estimates is tested empirically on 
a dataset of a mail-order catalogue. In particular, a Kolmogorov-Smirnov test was un- 
able to reject the hypothesis of normal distribution with a confidence of 0.05. The value 
function estimates are based on sample mean estimates of the MRP parameters. The 
parameter estimates are used in combination with the value equation to produce the 
value estimate. Different assumptions are made in the paper to simplify the analysis. 
A particularly important assumption is that the number of visits to a state is fixed. 
Under this assumption the sample mean parameter estimates are unbiased and the ap- 
plication of the value equation results in biased estimates. We show that without this 
assumption the sample mean estimates underestimate the parameters in the average 
and the value estimates can therefore be unbiased in special cases. We address this 
point in detail in Section 3.4. 

In (Singh and Sutton, 1996) different kinds of eligibility traces are introduced and 
analyzed. It is shown that TD(1) is unbiased if the replace-trace is used and that it 
is biased if the usual eligibility trace is used. What is particularly important for our 
work is one of their side findings: The Maximum Likelihood and the MC estimates are 
equivalent in a special case. We characterize this special case with Criterion 3 (p. 14) 
and we make frequent use of this property. We call the criterion the Full Information 
Criterion because all paths that are relevant for a value estimator in a state s must 
hit this state (For details see p. 14). 

In this paper we follow a new approach to the finite example case using tools from 
statistical estimation theory (e.g. (Stuart and Ord, 1991)). Rather than relying on 
bounds, on approximations, or on results to be recalculated for every specific MRP 
this approach allows us to derive general statements. Our main results are sketched 
in Figure 1. The major contribution is the derivation of the optimal unbiased value 
estimator (Minimum Variance Unbiased estimator (MVU), Sec. 3.3). We show that the 
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Least-Squares Temporal Difference estimator (LSTD) from (Bradtke and Barto, 1996) 
is equivalent to the Maximum Likelihood value estimator (ML) (Sec. 3.4.6) and that 
both are equivalent to the MVU if the discount 7 = 1 (undiscounted) and the Full 
Information Criterion is fulfilled or if an acyclic MRP is given (Sec. 3.4.3). In general 
the ML estimator differs from the MVU because ML fulfills the Bellman equation and 
because estimators that fulfill the Bellman equation can in general not be unbiased 
(We refer to estimators that fulfill the Bellman equation in the future as Bellman 
estimators). The main reason for this effect being the probability measures with which 
the expectations are taken (Sec. 3.1). The bias of the Bellman estimators vanishes 
exponentially in the number of observed paths. As both estimators differ in general it 
is natural to ask which of them is better? We show that in general neither the ML nor 
the MVU estimator are superior to each other, i.e. examples exist where the MVU is 
superior and examples exist where ML is superior (Appendix D.2). 

The first-visit MC estimator is unbiased (Singh and Sutton, 1996) and therefore 
inferior to the MVU. However, we show that for 7 = 1 the estimator becomes equivalent 
to the MVU if the Full Information Criterion applies (Sec. 3.5). Furthermore, we show 
that this equivalence is restricted to the undiscounted case. 

Finally, we compare the estimators to TD(A). We show that TD(A) is essentially 
unbiased for acyclic MRPs (Appendix B) and is thus inferior to the MVU and to the 
ML estimator for this case. In the cyclic case TD is biased (Sec. 3.6). 

An early version of this work was presented in (Griinewalder et al., 2007). The 
analysis was restricted to acyclic MRPs and to the MC and LSTD estimator. The two 
main findings were that LSTD is unbiased and optimal for acyclic MRPs and that MC 
equals LSTD in the acyclic case if the Full Information Criterion applies and 7 = 1. 
It turned out that the second finding was already shown in more generality by (Singh 
and Sutton, 1996) [Theorem 5]. The restriction to acyclic MRPs simplified the analysis 
considerably compared to the general case which we approach in this work. 
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UNBIASED Acyclic or FI + 7 = l BELLMAN 

Fig. 1 The figure shows two value estimator classes and four value estimators. On the left the 
class of unbiased value estimators is shown and on the right the class of Bellman estimators. 
The graph visualises to which classes the estimators belong and how the two classes are related. 
The cursive texts state conditions under which different estimators are equivalent, respectively, 
under which the two classes overlap. FI denotes the Full Information Criterion. 

Theoretical findings are summarized in two tables in section 3.7 (p. 21). Symbols are 
explained at their first occurrence and a table of notations is included in Appendix A. 
For the sake of readability proofs are presented in Appendix C. 
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2 Estimation in Reinforcement Learning 

A common approach to the optimization of a control policy is to iterate between 
estimating the current performance (value estimation) and updating the policy based 
on this estimate (policy improvement). Such an approach to optimization is called 
policy iteration (Sutton and Barto, 1998; Bertsekas and Tsitsiklis, 1996). The value 
estimation part is of central importance as it determines the direction of the policy 
improvement step. 

In this work we focus on this value estimation problem and we study it for Markov 
Reward Processes. In Reinforcement Learning Markov Decision Processes are typically 
used. A MRP is the same with the only difference being that the policy does not change 
over time. 



2.1 Markov Reward Processes 

A Markov Reward Process consists of a state space S (in our case a finite state space), 
probabilities p^ to start in state i, transition probabilities Pij and a random reward Rij 
between states i and j. The MRP is acyclic if no state i and no path n = (s\, S2, S3, . . .) 
exists such that P(ir) := Ps 1 s 2 Ps 2 s 3 . . . > and state i is included at least twice in 71". 

Our goal is to estimate the values V{ of the states in S, i.e. the expected future 
reward received after visiting state i. The value is defined as 

00 

Vi — ~^^Pij (jEi[Rij] + iVj) and in vector notation by V = 7*P*r = (I — 7P) r, 

where P = (pjj ) is the transition matrix of the Markov process, I the identity matrix, 
7 £ (0, 1] a discount factor and r is the vector of the expected one step reward (r.j = 
Sj'es Pij^j[Rij])- ln the undiscounted case (7 = 1) we assume that with probability 
one a path reaches a terminal state after a finite number of steps. 

A large part of this work is concerned with the relation between the maximum 
likelihood value estimator and the optimal unbiased value estimator. In particular, we 
are interested in equivalence statements for these two estimators. Equivalence between 
these estimators can only hold if the estimates for the reward are equivalent, meaning 
that the maximum likelihood estimator for the reward distribution matches with the 
optimal unbiased estimator. We therefore restrict our analysis to reward distributions 
with this property, i.e. we assume throughout that the following assumption holds: 

Assumption 1 The maximum likelihood estimate of the mean reward is unbiased and 
equivalent to the optimal unbiased estimate. 

The assumption is certainly fulfilled for deterministic rewards. Other important cases 
are normal distributed, binomial and multinomial distributed rewards. 

2.2 Value Estimators and Statistical Risk 

We compare value estimators with respect to their risk (not the empirical risk) 

E[£(Pi,Vi)], 
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where V, is a value estimator of state i and £ is a loss function, which penalizes the 
deviation from the true value Vi- We will mainly use the mean squared error 

MSE[Vi\:=n(Vi-Vi) 2 ], (1) 

which can be split into a bias and a variance term 

MSE[V: t ] = V[Vy +( M[Vi-Vi\ ) 2 . 

Variance Bias 

An estimator is called unbiased if the bias term is zero. The unbiasedness of an es- 
timator depends on the underlying probability distribution with which the mean is 
calculated. 

Typically, there is a chance that a state is not visited at all by an agent and it makes 
no sense to estimate the value if this event occurs. We encode the probability event that 
state i has not been visited with {Ni = 0} and that is has been visited at least once 
with {Ni > 1}, where Ni denotes the number of visits of state i. Unbiased estimators 
are estimators that are correct in the mean. However, if we take the (unconditional) 
mean for a MRP then we include the term E[t^|{JVj = 0}] into the calculation, i.e. the 
value estimate for the case that the estimator has not seen a single example. This is 
certainly not what we want. We therefore measure the bias of an estimator using the 
conditional expectation E[- |{JVj > 1}]. 



Equal Weighting of Examples We conclude this section by citing a simple criterion 
with which it is possible to verify unbiasedness and minimal MSE in special cases. This 
criterion provides an intuitive interpretation of a weakness of the TD(A) estimator (see 
Section 3.6). Let n,i = l,...,n be a sample consisting of n > 1 independent and 
identically distributed (iid) elements of an arbitrary distribution. The estimator 

71 71 

^S~\onXi, with < Qj < 1, and on = 1, (2) 

i=l i=l 

is unbiased and has the lowest variance for Qj = 1/n (Stuart and Ord, 1991). The Xi 
could, for example, be the summed rewards for n different paths starting in the same 
state s, i.e. Xi := X^t=o 7*4 : where denotes the reward at time t in path i. The 
criterion states that for estimators which are linear combinations of iid examples all 
examples should have an equal influence and none should be preferred over another. 
However, it is important to notice that not all unbiased estimators must be linear 
combinations of such sequences and that better unbiased estimators might exist. In 
fact this is the case for MRPs. The structure of a MRP allows better value estimates. 



2.3 Temporal Difference Learning 

A commonly used value estimator for MRPs is the TD(A) estimator (Sutton, 1988). It 
converges on average (Z^-convergence, (Sutton, 1988)) and it converges almost surely 
to the correct value (Watkins and Dayan, 1992; Jaakkola et al., 1994). In practical tasks 
it seems to outperform the MC estimator with respect to convergence speed and its 
computational costs are low. Analyses for the TD(0) estimator are often less technical. 
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We therefore restrict some statements to this estimator. TD(0) can be defined by means 
of an update equation: 

yrCi+1) = y(.0 + ai+1 (fl (*+D + 7 ^(i) _ v s (% (3) 

where oij+i is the learning rate, V s is the estimated value for state s after the ith 
transition, s' is the successor state of s and R^ 1 ^ is the reward which occurred during 

' ss ° 

the transition from s to s . The general TD(A) update equation is given by 

vp+V^Q+AtfW and zyyi i + 1 )=a i+1 (^ 1 )+ 7 ^-v?)4 i+1 ), 

where on is the learning rate in sample path i (the learning rate might be defined 
differently) and is an eligibility trace. The update equation can be applied after 

each transition (online), when a terminal state is reached (offline) or after an entire 
set of paths has been observed (batch update). The eligibility trace can be defined in 
various ways. Two important definitions are the accumulating trace and the replacing 
trace (Singh and Sutton, 1996). In (Singh and Sutton, 1996) it is shown that for A = 1 
the TD(A) estimator corresponding to the accumulating trace is biased while the one 
corresponding to the replacing trace is unbiased. The replacing trace is defined by 



e (i+D Jl if« = *. (4) 
7 A else. 



For acyclic MRPs both definitions are equivalent. For A < 1 the estimators are biased 
towards their initialization value. However, a minor modification is sufficient to delete 
the bias for acyclic MRPs (App. B on p. 30). We will mostly use this modified version. 



2.4 Monte Carlo Estimation 

The Monte Carlo estimator is the sample mean estimator of the summed future reward 
(Sutton and Barto, 1998). For acyclic MRPs the MC estimator is given by 




where n is the number of paths that have been observed. 

In the cyclic case there are two alternative MC estimators: First-visit MC and 
every-visit MC. First-visit MC makes exactly one update for each visited state. It uses 
the part of the path which follows upon the first visit of the relevant state. The first- 
visit MC estimator Vi is unbiased for every state i, i.e. E[t^|iVj > 1] = V^. Every-visit 
MC makes an update for each visit of the state. The advantage of the every-visit 
MC estimator is that it has more samples available for estimation, however, the paths 
overlap and the estimator is therefore biased (Singh and Sutton, 1996). Both estimators 
converge almost surely and on average to the correct value. 

The MC estimators are special cases of TD(A). The every-visit MC estimator is 
equivalent to TD(A) for the accumulate trace and the first-visit MC estimator for the 
replace trace if A = 1 and on = l/i. 
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3 Comparison of Estimators: Theory 

The central theme of this paper is the relation between two important classes of value 
estimators and between four concrete value estimators. One can argue that the two 
most important estimator classes are the estimators that fulfill the Bellman equation 
and estimators that are unbiased. The former class is certainly of great importance as 
the Bellman equation is the central equation in Reinforcement Learning. The latter 
class proved its importance in statistical estimation theory, where it is the central class 
of estimators that is studied. We analyse the relation between these two classes. 

On the estimator side we concentrate on popular Reinforcement Learning estima- 
tors (the Monte-Carlo and the Temporal Difference estimator) and on estimators that 
are optimal in the two classes. These are: (1) The optimal unbiased value estimator 
which we derive in Section 3.1. (2) The Maximum Likelihood (ML) estimator for which 
one can argue (yet not prove!) that it is the best estimator in the class of Bellman es- 
timators. 

Parts of this section are very technical. We therefore conclude this motivation with 
a high level overview of the main results. 

Estimator Classes: Unbiased vs. Bellman Estimators The key finding for these two 
estimator classes is that cycles in an MRP essentially separate them. That means if we 
have a MRP with cycles then the estimators can either fulfill the Bellman equation or at 
least some of the value estimators must be biased. The main factor that is responsible 
for this effect is the "normalization" {A^ > 1}. The Bellman equation couples the 
estimators, yet the estimators must be "flexible" to be unbiased with respect to different 
probability measures, i.e. the conditional probabilities P[- |{JVj > 1}]. 

Furthermore, we show that the discount has an effect onto the bias of Bellman 
estimators. Estimators that use the Bellman equation are based on parameter estimates 
pij. We show that these parameter estimates must be discount dependent. Otherwise, 
a "further bias" is introduced. 

We show that these factors are the main factors for the separation of the classes: 
(1) If the MRP is acyclic or (2) if the problem with the normalization and the discount 
is not present then Bellman estimators can be unbiased. 

Estimator Comparison and Ordering: MVU, ML, TD and MC The key contribution in 
this part is the derivation of the optimal unbiased value estimator. We derive this esti- 
mator by conditioning the first-visit Monte Carlo estimator with "all the information" 
that is available through the observed paths and we show that the resulting estimator 
is optimal with respect to any convex loss function. The conditioning has two effects: 

(1) The new estimator uses the Markov structure to make use of (nearly) all paths. 

(2) It uses "consistent" alternative cycles beside the observed ones. For example, if a 
cyclic connection from state 1 — > 1 is observed once in the first run and three times in 
the second run, then the optimal estimator will use paths with the cyclic connection 
being taken to 4 times. Consistent with this finding, we show that if the first-visit 
MC estimator observes all paths and the modification of cycles has no effect, then the 
first-visit MC estimator is already optimal. 

Furthermore, the methods from statistical estimation theory allow us to establish 
a strong relation between the MVU and the Maximum Likelihood estimator. The ML 
estimator uses also all information, but it is typically biased as it fulfills the Bellman 
equation. However, in the cases where the ML estimator is unbiased it is equivalent to 
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the MVU. In particular, the ML estimator is unbiased and equivalent to the MVU for 
acyclic MRPs and for MRPs where the Full Information Criterion applies. 

In the final theory part we are addressing the Temporal Difference estimator. In 
contrast to MC and ML the theoretical results for TD are not as strong. The reason 
being that the tools from statistical estimation theory that we are applying can be 
used to compare estimators inside one of the two estimator classes. However, TD is 
typically neither contained in the class of unbiased estimators nor in the class of Bell- 
man estimators. We are therefore falling back to a more direct comparison of TD to 
ML. The analysis makes concrete the relation of the optimal value estimator to TD 
and demonstrates the powerfulness of the Rao-Blackwell theorem. 

Beside the mentioned equivalence statements between different estimators we are 
also establishing orderings like "the MVU is at least as good as the first-visit MC 
estimator" or we are giving counter-examples if no ordering exists. 

3.1 Unbiased Estimators and the Bellman Equation 

In this section we analyze the relation between unbiased estimators and Bellman es- 
timators. Intuitively, we mean by "a value estimator V fulfills the Bellman equation" 
that V = r + 7PV, where r, P are the rewards, respectively the transition matrix, of 
a well defined MRP. We make this precise with the following definition: 

Definition 1 (Bellman Equation for Value Estimators.) An estimator V fulfills 
the Bellman equation if a MRP M exists with the same state space as the original MRP, 
with a transition matrix P, deterministic rewards r and with value V, i.e. V = f +7PV. 
Furthermore, M is not allowed to have additional connections, i.e. Pjj = if in the 
original MRP = holds. 

Two remarks: Firstly, we restrict the MRP M to have deterministic rewards for sim- 
plicity. Secondly, the last condition is used to enforce that the MRP M has a "similar 
structure" as the original MRP. However, it is possible for M to have fewer connections. 
For example, this will be the case if not every transition i —> j has been observed. 

Constraining the estimator to fulfill the Bellman equation restricts the class of esti- 
mators considerably. Essentially, the only degree of freedom is the parameter estimate 
P. If I — 7P is invertible then 

V = (I - 7 P)" 1 r =: V(P,r), 

i.e. V is completely specified by P and r. Here, V(P,r) denotes the value function for 
a MRP with parameters P and rewards f . In particular, the Bellman equation couples 
the value estimates of different states. This coupling of the value estimates introduces 
a bias. The intuitive explanation of the bias is the following: Assume we have two value 
estimators Vi, Vj and both are connected with a connection i — > j and pij — 1 holds. 
Fixing, E[Vj|{iVj > 1}] = Vj defines then essentially the value for V as Vi = +"/Vj. 
Yet, the value for Vi must be flexible to allow Vi to depend on the probability of 
{Nj > 1}, as E[V^|{JVj > 1}] = Vi must hold. It is in general not possible to fulfill 
both constraints simultaneously in the cyclic case, i.e. constraining Vi for all states i 
and enforcing the Bellman equation. However, value estimators for single states can be 
unbiased, even if the Bellman equation is fulfilled. 

Another factor that influences the bias is the discount 7. If the Bellman equation 
is fulfilled by V then the value estimate can be written as X^t^o 7 i- e - 7* weights 
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the estimate P of P . If JE[P ] 7^ P* and the parameter estimate P is independent 
of 7 then with varying 7 the deviations of P from P* are weighted differently and 
it is intuitive that we can find a 7 for which the weighted deviations does not cancel 
out and the estimator is not unbiased. This effect can be circumvented by making the 
parameter estimator P discount dependent. 



3.1.1 Normalization P[{-/V.; > 1}] and Value Estimates on {iVj = 0} 

Consider the MRP shown in Figure 2 (B) and let the number of observed paths be one 
(n = 1). The agent starts in state 2 and has a chance of p to move on to state 1. The 
value of state 1 and 2 is 



V 1 = V 2 = (l- P )J2 i P = 7 



i=0 



Using the sample mean parameter estimate p = + 1), we get the following value 
estimate for state 2: 



V 2 (i) = -^- = i =► E[V 2 (t)] = (l-p)V;^2(*)p i = V r 2 ) 

where V^(i) denotes the value estimate, given the cyclic transition has been taken i 
times. The estimator fulfills the Bellman equation. Therefore, V\(i) = V2W = i, given 
at least one visit of state 1, i.e. conditional on the event {N\ > 1}. The expected value 
estimate for state 1 is therefore 

TEW (4WSN ■> Ul - Q E£lj^ _ ~P) _ Vl 

where (1 — p) Y^S^i = P is the normalization. Hence, the estimator is biased. 

Intuitively the reasons for the bias are: Firstly, V\ equals V% on {N\ > 1} but 
the estimators differ (in general) on {A^i = 0}. In the example, we made no use of 
this point. We could make use of it by introducing a reward for the transition 2^3. 
Secondly, the normalization differs, i.e. versus E[- \{N\ > 1}]. In our example we 
used this point. Both estimators are on {jVi = 0} and are therefore always equivalent. 
However, the expectation is calculated differently and introduces the bias. 

The following Lemma shows that this problem does not depend on the parameter 
estimate we used: 

Lemma 1 (p. 32) For the MRP from Figure 2 (B) there exists no parameter estimator 
p such that Vi(p) is unbiased for all states i. 

How do these effects behave in dependence of the number n of observed paths? Let 
Pi denote the probability to visit state i in one sampled path. Then the probability 
of the event {N^ = 0} drops exponentially fast, i.e. P[{iVj = 0}] < (1 — Pi) n and 
the normalization l/P[{JVj > 1}] approaches one exponentially fast. Therefore, if the 
estimates are upper bounded on {iVj = 0} then the bias drops exponentially fast in n. 
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3.1.2 Discount 

Consider the MRP from Figure 2 (A) for one run (n = 1) and for 7 < 1. We use again 
the sample mean parameter estimate, i.e. p = + 1) if the cyclic transition has been 
taken i times. The value of state 1 is 



1-p , , , . . l-i/(i + l) 

SLUG! ^~ l-i t tit, 1 1 1 ^ fici-i- 1 -m r\ 4- /-\ i in 1 /.. 

1 -IP 

The estimator is unbiased if and only if 



Vi = (1 — p) 7*0* = and the value estimate is Vt = 

V 1-7P 1 -71/(1 + !) 



(i - P) x; 7V = Eft] = a - P ) x: t^Sti)^- 

i=0 i=0 

? 

The equality marked with = holds if and only if 

With induction one sees that 7* < r '^^A - Induction step: 

i+i ^ 7 IZ_H < ^ 7 f i _ _J < 1 _ »±1 2! 

7 l-7^~l-7^ V i + 1 i + 2,/- i + 2 (t + l)(< + 2) 

«■ (1 - 7t)( 7 - 1) < (1 - 7 ) 2 -(* - 7») < 1 - 7, 

where the last inequality holds, because — (i — yi) < and (1 — 7) > 0. I.H. denotes 
Induction Hypothesis. Furthermore, for i = 1 

2 1 1 — 1 

0<(l-7) 2 = l-2 7 + 7 2 ^ 7 -l-<-^7< I — I 

holds. Hence, the estimator is biased for all 7 < 1. It is only unbiased if 7 = 1. 

In general, value estimators that fulfill the Bellman equation, respectively use the 
value function, must at least be discount dependent to be able to be unbiased for 
general MRPs, as the following Lemma shows: 

Lemma 2 (p. 33) For the MRP from Figure 2 (A) and for n — 1 there exists no 
parameter estimator p that is independent 0/7 such that V(p) is unbiased for all pa- 
rameters p and all discounts 7. 



3.2 Maximum Likelihood Parameter Estimates and Sufficient Statistics 

We start this section with a derivation of the maximum likelihood parameter estimates. 
After that we introduce a minimal sufficient statistics for MRPs and we show that this 
statistic equals the maximum likelihood estimates. 
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Fig. 2 A: A cyclic MRP with starting state 1 and with probability p for the cyclic transition. 
The reward is 1 for the cyclic transition and otherwise. B: A cyclic MRP with starting state 
2 and with probability p for the cyclic transition. The reward is 1 for the cyclic transition from 
state 2 to state 1 and otherwise. 



3.2.1 Maximum Likelihood Parameter Estimates 

Let be the transition probability of state i to j, p, the probability to start in i 
and x a sample consisting of n iid state sequences x\, . . . , x n . The log-likelihood of the 
sample is 

n 

logP[x|p] = ^logP[a; fc |p]. 

k=l 

The corresponding maximization problem is given by 

n 

max y~] log P [xj \pij ,pj], a-t.-.y] = = 1. 

The unique solution for pij and Pi (Lagrange multipliers) is given by 

Pij = jjr- =: pij and p l = - (if, - ^ fiji) =: pi, (5) 

where Ki denotes the number of visits of state i, [lij the number of direct transitions 
from i to j, pij the estimate of the true transition probability p^ and pi the estimate 
of the true starting probability pj. 

3.2.2 Sufficient Statistics for the MRP Parameters 

Information about a sample is typically available through a statistic S of the data (for 
example S = Yli x it where a; is a sample). A statistic which contains all information 
about a sample is called sufficient. Important properties of sufficient statistics are 
minimality and completeness. The minimal sufficient statistics is the sufficient statistic 
with the smallest dimension (typically the same dimension as the parameter space). 
Formally, suppose that a statistic S is sufficient for a parameter 9. Then S is minimally 
sufficient if S is a function of any other statistic T that is sufficient for 8. Formally, 
a statistic S is complete if Eg[/i(S)] = for all 8 implies h = almost surely. The 
theorem from Rao and Blackwell (Stuart and Ord, 1991) states that for a complete 
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and minimal sufficient statistics S and any unbiased estimator A of a parameter the 
estimator E[A|S] is the optimal unbiased estimator with respect to any convex loss 
function and hence the unbiased estimator with minimal MSE. 

The maximum likelihood solution is a sufficient statistics for the MRP parameters. 
We demonstrate this with the help of the Fisher- Ney man factorization theorem (Stuart 
and Ord, 1991). It states that a statistic is sufficient if and only if the density /(x|#) 
can be factored into a product g(S,9)h(x). For a MRP we can factor the density as 
needed by the Fisher-Neyman theorem (/i(x) = 1 in our case), 

n Li 

p(xb)=n(^(i ) n^a-i)x i a))=n^" Es ' Ms ' J n ■ 

i=l j=2 sGS s,s'e§ 

where Xj(j) is the jth state in the ith path, n the number of observed paths and 
Li the length of the ith path. K s /j. ss i is sufficient for p ss i and because sufficiency is 
sustained by one-to-one mappings (Stuart and Ord, 1991) this holds true also for fj, ss i . 
The sufficient statistics is minimal because the maximum likelihood solution is unique 
(Stuart and Ord, 1991) 1 . The sufficient statistic is also complete because the sample 
distribution induced by an MRP forms an exponential family of distributions (Lemma 
4, page 33). A family {Pg} of distributions is said to form an s-dimensional exponential 
family if the distributions Pg have densities of the form 

pg(x) = exp^T^WW - A(0)W) (6) 

with respect to some common measure [i (Lehmann and Casella, 1998). Here, the % 
and A are real-valued functions of the parameters, the Ti are real-valued statistics and 
a; is a point in the sample space. The 77's are called natural parameters. It is important 
that the natural parameters are not functionally related. In other words no / should 
exist with 772 = /(?7l)- If the natural parameters are not functionally related, then the 
distribution is complete (Lehmann and Casella, 1998). Otherwise, the family forms 
only a curved exponential family and a curved exponential family is not complete. 



3.3 Optimal Unbiased Value Estimator 

The Rao-Blackwell theorem (Stuart and Ord, 1991) states that for any unbiased esti- 
mator A the estimator EL4|S] is the optimal unbiased estimator with probability one 
(w.p.l), given S is a minimal and complete sufficient statistic. For the case of value 
estimation this means that we can use any unbiased value estimator (e.g. the Monte 
Carlo estimator) and condition it with the statistic induced by the maximum likelihood 
parameter estimate to get the optimal unbiased value estimator. 

Theorem 2 Let V be the first-visit Monte-Carlo estimator and S the sufficient and 
complete statistics for a given MRP. The estimator ~E[V\S] is unbiased and the optimal 
unbiased estimator with respect to any convex loss function w.p.l. Especially, it has 
minimal MSE w.p.l. 

1 It is needed to use the minimal parameter set of the MRP to be formally correct. The 
minimal sufficient statistics excludes also one value u 3s i , however the missing value is defined 
by the other fj,'s. 
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From now on, we refer to the estimator E[V|S] as the Minimum Variance Unbiased 
estimator (MVU). For a deterministic reward the estimator E[V"|S] is given by 

7ren(s) 

where 7r := (ttl, . . . , 7Tj) denotes a vector of paths, II(S) denotes the set of vectors of 
paths which are consistent with the observation S, | ■ | is the size of a set and V(ir) is 
the MC estimate for the vector of paths tt. Essentially, tt is an ordered set of paths 
and it is an element of II(S) if it produces the observed transitions, starts and rewards. 
The MC estimate is simply the average value for the paths in tt. The estimator E[V"|§] 
is thus the average over all paths which could explain the (compressed) observed data 
§. As an example, take the two state MRP from Figure 2 (A). Assume that an agent 
starts twice in state 1, takes three times the cycle in the first run and once in the 
second. The paths which are consistent with this observation are: 

n(S) = {((1, 1,1,2), (1,2)), ((1,1,2), (1,1,2)), ((1,2), (1,1, 1,2))}. 

The MC estimator for the value of a state s does not consider paths which do not hit 
s. On the contrary to that the conditioned estimator uses these paths. To see this take 
a look at the MRP from Figure 8 (A) at p. 36. Assume, that two paths were sampled: 
(1, 2, 4) and (2, 3). The MC value estimate for state one uses only the first path. Taking 
a look at 

n(S) = {((1, 2, 4), (2, 3)), ((1, 2, 3), (2, 4)), ((2, 3), (1, 2, 4)), ((2, 4), (1, 2, 3))}, 
we see that the conditioned estimator uses the information. 

3.3.1 Costs of Unbiasedness 

The intuition that the MVU uses all paths is, however, not totally correct. Let us take 
a look at the optimal unbiased value estimator of state 1 of the MRP in Figure 2 (B) 
for 7=1. Furthermore, assume that one run is made and that the path (2,1,2,3) 
is observed. No permutations of this path are possible and the estimate of state 1 is 
therefore the MC estimate of path (1, 2, 3), which is 0. In general, if we make one run 
and we observe i transitions from state 2 to state 1, then the estimate is (i — 1). I.e. we 
ignore the first transition. As a consequence, we have on average the following estimate: 

oo 

The term p is exactly the probability of the event {Ni > 1} and the estimator is 
conditionally unbiased on this event. The intuition is, that the estimator needs to 
ignore the first transition to achieve (conditional) unbiasedness. 

Hence, unbiasedness has its price. Another cost beside this loss in information is 
that the Bellman equation cannot be fulfilled. In Section 3.1 we started with Bellman 
estimators and we showed that the estimators are biased. Here, we have a concrete 
example of an unbiased estimator that does not fulfill the Bellman equation, as V% = 
(i — 1) ^ i = V%. For this example this is counterintuitive as p\i = 1 and essentially no 
difference between the states exists in the undiscounted case. 
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3.3.2 Undiscounted MRPs 

In the undiscounted case permutations of paths do not change the cumulated reward. 
For example, Ya=1 R ir(i)ir(i+1) 

— Rir(cr(i))n(cr(i)+1) ' if is a permutation of 

(1, . . . , n), because the time at which a reward is observed is irrelevant. This invariance 
to permutations implies already a simple fact. We need the following criterion to state 
this fact: 

Criterion 3 (Full Information) A state s has full information if, for every succes- 
sor state s' of s and all paths tv, it holds that 

7r(i) = s 3j with j < i and ir(j) = s. 

7r(i) denotes the ith state in the path. 

Let 7T be a vector of paths following the first visit of state s that are consistent 
with the observations. V(tt) is then given by (l/|-7r|) Yli Ylj Rj)+ii where | -jt | is the 

U) 

number of paths contained in 77 and Rjj+i is the observed reward in path i at position 
j. Rearranging the path does not change the sum and the normalizing term. Therefore 
each consistent path results in the same first-visit MC estimate and the MVU equals 
the first-visit MC estimator. 

Corollary 1 Let V be the first-visit MC estimator and let the value function be undis- 
counted. If the Full Information Criterion applies to a state s, then 

M[V S \§] = Vs. 

The undiscounted setting allows alternative representations of the optimal estimator. 
As an example, suppose we observed one path it := (1, 1, 1,2) with reward R(tt) — 
2i?n + 1R\2- The optimal estimator is given by R(ir). Alternatively, we can set the 
reward for a path n with j-cycles to R(tt) := jR\\ + R12 and define a new probability 
measure TP[{j cycles}] such that X^^Lo J-^Ki cycles}] = i, i.e. we average over the set 
of paths with to "00" many cycles using the probability measure TP[{j cycles}]. If 
this measure is constraint to satisfy X^ = Q.7'P[{j cycles}] = i, then 

00 

cycles}] (jR n + R l2 ) = iR n + R 12 = MVU. (8) 

We pronounce this point here, because the ML value estimator, which we discuss in 
the next section, can be interpreted in this way. 

3.3.3 Convergence 

Intuitively, the estimator should converge because MC converges in L 1 and almost 
surely. Furthermore, conditioning reduces norm-induced distances to the true value. 
This is already enough to follow L 1 convergence but the almost sure convergence is 
not induced by a norm. We therefore refer to an integral convergence theorem which 
allows us to follow a.s. under the assumption that the MC estimate is upper bounded 
by a random variable Y £ L . Details are given in Appendix C.3. 

Theorem 4 (p. 34) E[V"|§] converges on average to the true value. Furthermore, it 
converges almost surely if the MC value estimate is upper bounded by a random variable 

Y e L 1 . 
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Such a Y exists for example, if the reward is upper bounded by Rmax and if 7 < 1 as 
in this case each MC estimate is smaller than Rmax 53£o = Rrnax/iX — 7). 

A MVU algorithm can be constructed using Equation 7. However, the algorithm 
needs to iterate through all possible paths and therefore has an exponential computa- 
tion time. 



3.4 Least-Squares Temporal Difference Learning 

In this section we discuss the relation of the MVU to the LSTD estimator. The LSTD 
estimator was introduced by (Bradtke and Barto, 1996) and extensively analyzed in 
(Boyan, 1998) and (Boyan, 1999). Empirical studies showed that LSTD often outper- 
forms massively TD and MC with respect to convergence speed per sample size. In 
this section we support these empirical findings by showing that the LSTD estimator 
is equivalent to the MVU for acyclic MRPs and closely related to the MVU for undis- 
counted MRPs. We derive our statements not directly for LSTD, but for the maximum 
likelihood value estimator (ML) which is equivalent to LSTD (Section 3.4.6). The es- 
timator is briefly sketched in (Sutton, 1988), where it is also shown that batch TD(0) 
is in the limit equivalent to the ML estimator. The estimator is also implicitly used in 
the certainty-equivalence approach, where a maximum likelihood estimate of an MDP 
is typically used for optimization. 

3-4-1 Maximum Likelihood Estimator 

The ML value estimator is given by V(P,r), where P := (pij) is the maximum likeli- 
hood estimate of the transition matrix and f is the vector of the maximum likelihood 
estimates of the expected one step reward. Hence, the ML value estimator is given by: 

00 

V = ^ 7 ^f = (I-7P)- 1 ?, (9) 

i=0 

whereas the Moore-Penrose pseudoinverse is used if P is singular (e.g. too few samples) . 
3-4-2 Unbiasedness and the MVU 

If an estimator is a function of the sufficient statistic (e.g. V = /(§)) then the con- 
ditional estimator is equal to the original estimator, V = E[U|§]. If the estimator V 
is also unbiased then it is due to the Rao-Blackwell theorem the optimal unbiased es- 
timator w.p.l. The defined maximum likelihood estimator is a function of a minimal 
and complete sufficient statistic. Therefore, the following relation holds between the 
ML estimator and the MVU: 

Corollary 2 The ML estimator is equivalent to the MVU w.p.l, if and only if it is 
unbiased. 

The following tow subsections address two cases where ML is unbiased. 
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3.4-3 Acyclic MRPs 



The ML estimator is unbiased in the acyclic case and therefore equivalent to the MVU. 
Theorem 5 (p. 35) The ML estimator is unbiased if the MRP is acyclic. 
Corollary 3 The ML estimator is equivalent to the MVU w.p.l if the MRP is acyclic. 



3.4.4 Undiscounted MRPs 



It is also possible that ML value estimates for specific states are unbiased even if 
the MRP is cyclic. One important case in which ML value estimates are unbiased 
is characterized by the Full Information Criterion. If it applies to a state i then the 
normalization P[{iVj > 1}] does not depend on the normalizations of the successor 
states. And in a way the problem of Section 3.1.1 does not affect state i. 

This can be shown by using Theorem 5 from (Singh and Sutton, 1996), which 
states that the ML estimator equals the first-visit MC estimator if the Full Information 
Criterion holds and 7 = 1. Furthermore, in this case the first- visit MC estimator is 
equivalent to the MVU w.p.l (Corollary 1). Hence, ML is unbiased and optimal w.p. 
1. We state this as a corollary: 

Corollary 4 The ML estimator of a state i is unbiased and equivalent to the MVU 
w.p.l if the Full Information Criterion applies to state i and if 7 = 1. 

We analyze this effect using a simple MRP and we give two interpretations. 



Example: Cyclic MRP - Unbiased We start with calculating the bias of ML explicitly 
for a simple MRP and thus "verifying" the Corollary. The value of state 1 for the 
MRP of Figure 2 (A) with modified rewards Rn — 1, R12 = and 7 = 1 is V\ = 
(1 — P) 5^i=o The ML estimate for a sample of n paths is 



=0 



where k is the number of taken cycles (summed over all observed paths). Therefore 



n 

n ^ — ' 



n ■ 

i=l 



E[Vi] = E 
Furthermore, 

00 

en = (i- P ) J2 k ^ = y i 

and the ML estimator is unbiased. Now, Corollary 2 tells us that the ML estimator is 
equivalent to the MVU w.p.l. 

It is also possible to show this equivalence using simple combinatorial arguments. 
The MVU and the MC estimate for this MRP is |: Let u be the number of ways how k 
can be split onto n-paths. For each split the summed reward is k and the MC estimate 
is therefore £. Hence, the MVU is = £. 
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Interpretation I: Non-linearity vs. Underestimated Parameters It is interesting that 
ML is unbiased in this example. In general nonlinear transformations of unbiased pa- 
rameter estimates produce biased estimators, as 

mm = m = mm 

essentially means that / is a linear transformation as / and E commute. Furthermore, 
the value function is a nonlinear function. Yet, in our example the parameter estimator 
8 is actually not unbiased. For n — 1: 

oo . oo oo 

= d - p) £ feTT pfe < (i - p) E p fc - (! - p) E P k+1 = p- 

fc=0 k=l fc=0 

The parameter is underestimated on average. The reason for this lies in the depen- 
dency between the visits of state 1. For a fixed number of visits, respectively for iid 
observations the parameter estimate would be unbiased. The relation between these 
two estimation settings is very similar to the first-visit and every-visit MC setting. The 
first-visit MC estimator is unbiased because it uses only one observation per path while 
the every-visit MC estimator is biased. In our case, the effect is particularly paradox 
as for the iid case the value estimator is biased. 

Interpretation II: Consistency of the Set of Paths The ML estimator differs in general 
from the MVU because it uses paths that are inconsistent with the observation S. For 
example, given the MRP from Figure 2 (A) with modified rewards Rn = 1, Ryi = 
and the observation (1,1,1,2). The set of paths consistent with this observation is 
again {(1, 1, 1, 2)}. The ML estimator, however, uses the following set of paths 

{(1,2), (1,1,2), (1,1, 1,2), (1,1, 1,1,2)...}, 

with a specific weighting P[{j cycles}] for a path that contains j cycles. In general, 
this representation will result in an estimate that is different from the MVU estimate. 
However, if Corollary 4 applies then both representations are equivalent. The ML 
estimator can under the assumptions of the corollary be represented as a sum over the 
cycle times with each summand being a product between the estimated path probability 
and the reward of the path. One can see this easily for the example (one run with i — 2 
cycles being taken): The path probability is in this case simply TP[{j cycles}] = pP (1— p) 
and because YlfLoJp'il — p) — i = 2 (Eq. 10 with n = 1) the estimate is equal to 
2i?n + R12 which is exactly the MVU estimate (compare to eq. 8 on p. 14). 

3.4.5 Which Estimator is better? The MVU or ML? 

The MVU is optimal in the class of unbiased estimators. However, this does not mean 
that the ML estimator is worse than the MVU. The ML estimator is also a function 
of the sufficient statistics, it is just not unbiased. To demonstrate this, we present two 
examples based on the MRP from Figure 2 (A) in Appendix D.2 (p. 37). One for which 
the MVU is superior and one where the ML estimator is superior. We summarize this 
in a corollary: 

Corollary 5 MRPs exist in which the MVU has a smaller MSE than the ML estimator 
and MRPs exist in which the ML estimator has a smaller MSE than the MVU. 



E 



k + 1 
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3.4-6 The LSTD Estimator 

The LSTD algorithm computes analytically the parameters which minimize the empir- 
ical quadratic error for the case of a linear system. (Bradtke and Barto, 1996) showed 
that the resulting algorithm converges almost surely to the true value. In (Boyan, 
1998) a further characterization of the least-squares solution is given. This turns out 
to be useful to establish the relation to the ML value estimator. According to this 
characterization, the LSTD estimate V is the unique solution of the Bellman equation, 
i.e. 

V = r + 7 PV, (11) 

where r is the sample mean estimate of the reward and P is the maximum likelihood 
estimate of the transition matrix. 

Comparing Equation 11 with Equation 9 of the ML estimator it becomes obvious 
that both are equivalent if the sample mean estimate of the reward equals the maximum 
likelihood estimate. 

Corollary 6 The ML value estimator is equivalent to LSTD if the sample mean and 
the maximum, likelihood estimator of the expected reward are equivalent. 

3.5 Monte Carlo Estimation 

We first summarize Theorem 5 from (Singh and Sutton, 1996) and Cor. 1 from p. 14: 

Corollary 7 The (first-visit) MC estimator of a state i is equivalent to the MVU and 
to the ML estimator w.p. 1 if the Full Information Criterion applies to state i and an 
undiscounted MRP is given. 

Essentially, the corollary tells us that in the undiscounted case it is only the "amount" 
of information that makes the difference between the MC estimator and the MVU, 
respectively the ML estimator. Amount of information refers here to the observed 
paths. If MC observes every path then the estimators are equivalent. 

From a different point of view this tells us that in the undiscounted case the MRP 
structure is only useful for passing information between states, but yields no advantage 
beyond that. 

3. 5. 1 Discounted MRPs 

In the discounted cyclic case the MC estimator differs from the ML and the MVU 
estimator. It differs from ML because ML is biased. The MC estimator is equivalent 
to the MVU in the undiscounted case because the order in which the reward is pre- 
sented is irrelevant. That means the time at which a cycle occurs is irrelevant. In the 
discounted case this is not true anymore. Consider again the MRP from Figure 2 (A) 
with rewards Tin = 1, R\2 = and the following two paths tt = ((1,1,1,2),(1,2)). 
The MC estimate is 1/2((1 + 7) + 0). The set of paths consistent with this ob- 
servation is JT(S) = {((1, 1, 1, 2), (1, 2)), ((1, 1, 2), (1, 1, 2)), ((1, 2), (1, 1, 1, 2))}. Hence, 
the MVU uses the cycle (1, 1, 2) besides the observed ones. The MVU estimate is 
1/3((1 + 7)/2 + 2/2 + (1 + 7)/2) = 1/3(2 + 7). Both terms are equivalent if and only 
if 7 = 1. For this example the Full Information Criterion applies. 
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Similarly, for acyclic MRPs the MC estimator is different from the ML/MVU 
estimator if 7 < 1. Consider a 5 state MRP with the following observed paths: 
((1, 3, 4), (1, 2, 3, 5)), a reward of +1 for 3 -> 4 and -1 for 3^5. The ML estimate is 
(I/47 2 + I/47) (1 - 1) = 0, while the MC estimate is l/2(-7 2 + 7) which is if and 
only if 7 = 1. Again the Full Information Criterion applies. 

3.5.2 Ordering with Respect to other Value Estimators 

Beside the stated equivalence the MVU is for every MRP at least as good as the first- 
visit MC estimator, because the first-visit MC estimator is unbiased. The relation to 
ML is not that clear cut. In general MRPs exist where the first visit MC estimator is 
superior and MRPs exist where the ML estimator is superior (See Appendix D.2, p. 
37 for examples). How about TD(A)? Again the relation is not clear cut. In the case 
that the MRP is acyclic and that Corollary 7 applies the first-visit MC estimator is at 
least as good as TD(A). In general, however, no ordering exists (See Appendix D.l, p. 
35 for examples). 

3.6 Temporal Difference Learning 

One would like to establish inequalities between the estimation error of TD and the 
error of other estimators like the MVU or the ML estimator. For the acyclic case 
TD(A) is essentially unbiased and the MVU and the ML estimator are superior to TD. 
However, for the cyclic case the analysis is not straightforward, as TD(A) is biased for 
A < 1 and does not fulfill the Bellman equation. So TD is in a sense neither in the 
estimator class of the MVU nor of the ML estimator and conditioning with a sufficient 
statistics does not project TD to either of these estimators. 

The bias of TD can be verified with the MRP from Figure 2 (A) with rewards 
R\i — 1, R12 = 0, with a discount of 7 = 1 and with n = 1. If we take the TD(0) 
estimator with a learning rate of ctj — 1/j then the value estimate for state is 
i/(i + 1) ~}2j = i 1/j if * cyclic transitions have been observed. The estimate should on 
average equal i to be unbiased. Yet, for i > it is strictly smaller than i. 

While our tools are not usable to establish inferiority of TD, we can still interpret 
the weaknesses of TD with it. In the following we focus on the TD(0) update rule. 

3. 6. 1 Weighting of Examples and Conditioning 

In the examples comparing TD(A) and MC (Section D.l.l p. 35) one observes that a 
weakness of TD(0) is that not all of the examples are weighted equally. In particular, 
Equation 2 on page 5 suggests that no observation should be preferred over another. 
Intuitively, conditioning suggests so too: For an acyclic MRP TD(0) can be written 
as Vi — Pij(Rij + 7V7), whereas pij differs from the maximum likelihood parameter 
estimates p^j due to the weighting. Generally, conditioning with a sufficient statistics 
permutes the order of the observations and resolves the weighting problem. Therefore, 
one would assume that conditioning with the element pij of the sufficient statistics 
changes Vi to p^j (Rij + r yVj). As conditioning improves the estimate, the new estimator 
would be superior to TD(0). However, conditioning with just a single element p^j must 
not modify the estimator at all, as the original path might be reconstructed from the 
other observations. E.g. if one observes a transition 1^2 and 2^3, with 2-^3 
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being the only path from state 2 to state 3, then it is enough to know that transition 
1 — ► 2 occurred and state 3 was visited. 

Despite these technical problems, the superiority of pij over pij and the weighting 
problem are reflected in the contraction properties of TD(0). Due to (Sutton, 1988) 
TD(0) contracts towards the ML solution. Yet, the contraction is slow compared to 
the case where each example is weighted equally. 

3.6.2 Weighting of Examples and Contraction Factor 

We continue with another look at the familiar ML equation: V = r + 7PV =: TV. 
If the matrix P is of full rank then the ML estimate is the sole fixed point of the 
Bellman operator T. The ML estimate can be gained by solving the equation, i.e 
V = (I — 7P)~ 1 r. Alternatively, it is possible to make a fixed point iteration. I.e. 
starting with an initial guess V^ ^ and iterating the equation, i.e v'™' = TV'" _1 l 
Convergence to the ML solution is guaranteed by the Banach Fixed Point Theorem, 
because T is a contraction. The contraction factor is upper bounded by 7||P|| < 7, 
where 1 1 ■ 1 1 denotes in the following the operator norm. The bound can be improved by 
using better suited norms (e.g. (Bertsekas and Tsitsiklis, 1996)). Hence, for n updates 
the distance to the ML solution is reduced by a factor of at least 7". 

Applying the TD(0) update (Eq. 3) to the complete value estimate V using P and 
a learning rate of 1/n results in 

V (n) = y(n-l) + i( r + 7 p\K™- 1 ) - V (n - 1)N ) = ( + -t) V^ 1 '. 

n V J \ n n J 

In this equation the weighting problem becomes apparent: The contraction T affects 
only a part of the estimate. Yet, the operators §W := (^^- + ^T) are still contrac- 
tions. For V and W: 

|is (n) v-s (n) w|| < ^— ^iiv-wii + -||Ti|||v-wi| < n ~ 1 + 7 |iv-w||. 

n n n 

The contraction coefficient is therefore at least "~^ + ^ ■ The ML solution (in the fol- 
lowing V) is a fixed point for the and for n iterations the distance is bounded 
by 

||g(n) g(l)y(0) _ vll < n*"=° 1( ' + 7 W °) - VII. 

The smaller 7 the faster the contraction. Yet, even in the limit the contraction is much 
slower than the contraction with the ML fixed point iteration, i.e. for 7 = the distance 
decreases at least with 1/n while for the ML fixed point iteration it decreases with 7™. 
For 7 = 0.1 and two applications of the Bellman operator the contraction is at least 
7 2 = 1/100 and it needs 100 iterations with the TD(0) equation to reach the same 
distance. 

TD(0) is applied only to the current state and not to the full value vector. The 
same can be done with the ML fixed point iteration, i.e. Vi = pij(Rij + jVj). We 
analyze the contraction properties of this estimator in the empirical part and we refer 
to the estimator as the iterative Maximum Likelihood (iML) estimator. The costs of the 
algorithm are slightly higher than the TD(0) costs: 0(|S|) (time) and 0(|S| 2 ) (space). 

The restriction to the current path does not affect the convergence, i.e. the restricted 
iteration converges to the ML solution. Intuitively, the convergence is still guarantied, 
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as a contraction of 7 is achieved by visiting each state once and because each state is 
visited infinitely often. Using that idea the following Theorem can be proved: 

Theorem 6 iML is unbiased for acyclic MRPs, converges on average and almost 
surely to the true value. 

We use this algorithm only for the analysis and we therefore omit the proof. 
3.7 Summary of Theory Results 

We conclude the theory section with two tables that summarize central properties of 
estimators and established orderings. Footnotes are used to reference the corresponding 
theorems, corollaries or sections. We start with a table that summarizes the properties 
of the different estimators (Table 1). The row Optimal refers to the class of unbiased 
estimators and to convex loss functions. The statement that ML is unbiased if the 
Full Information Criterion is fulfilled and 7=1 applies state wise. I.e. for a cyclic 
MRP there will exist a state for which the ML estimator is biased. However, if the 
Full Information Criterion applies to a state, then the ML estimator for this particular 
state is unbiased. Finally, F-visit MC denotes the first-visit Monte-Carlo estimator. 



Estimator 


MVU 


ML/LSTD 


TD(A) 


(F-visit) MC 


Convergence 


L\ a.s.W 


L 1 , a.s. 


L 1 , a.s. 


L 1 , a.s. 


Cost (Time) 
Cost (Space) 


exp?t 2 > 


o(|§| 3 ) 
o(|s| 3 ) 


0(|S|) 
0(|S|) 


0(|S|) 
0(]S|) 


Unbiased 


/3) 


Acyclic' 4 ' or Cr. 3 
and 7 = l( 5 > 


Acyclic^ 6 ) 


V 


Bellman 


AcyclicW or Cr. 3 
and 7 = l( 5 ) 


V 






Optimal 




Acyclic^ 4 ' or Cr. 3 
and 7 = 1< 5 > 




Cr. 3 and 7 = l( 7 > 



Table 1 Comments and references: (1) Th. 4, p. 14. (2) Eq. 7, p. 13. (3) Th. 2, p. 12. (4) Cor. 
3, p. 16. (5) Cor. 4, p. 16. (6) Minorly modified TD estimator. Th. 8, p. 32. (7) Cor. 7, p. 18. 
Counterexamples for 7 < 1: Sec. 3.5.1 p. 18. 



Table 2 summarizes established orderings between value estimators. The legend 
is the following: = means equivalent, 7^ means not comparable, < means that the 
estimator in the corresponding row has a smaller risk (estimation error) than the 
estimator in the corresponding column. With In general we mean for 7^ that there 
exist MRPs where the row estimator is superior and MRPs where the column estimator 
is superior. However, for a subclass of MRPs, like acyclic MRPs, one of the estimators 
might be superior or they might be equivalent. For < in general means that the row 
estimator is always as good as the column estimator, however, both might be equivalent 
on a subclass of MRPs. 

4 Comparison of Estimators: Experiments 

In this section we make an empirical comparison of the estimators. We start with a 
comparison using acyclic MRPs. For this case the ML estimator equals the MVU and 
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ML/LSTD 


TD(A) 


(F-visit) MC 


MVU 


ML unbiased: =W 
In general: ^( 2 ) 


Acyclic: <<- 3 ' 4 ) 


Cr. 3 and 7 = 1: =( 5 ^ 
In general: <( 4 ) 


ML/LSTD 




Acyclic: <(3,4,7) 


Cr. 3 and 7 = 1: =( 6 ) 
In general: ^( 2 ) 


TD(A) 




1 In general: 



Table 2 Comments and references: (1) Cor. 2, p. 15. (2) Counterexamples: App. D.2, p. 37. 
(3) Minorly modified TD estimator. Th. 8, p. 32. (4) Th. 2, p. 12. (5) Cor. 7, p. 18. (6) Th. 5 
in (Singh and Sutton, 1996). (7) Cor. 3, p. 16. (8) Counterexamples: App. D.l, p. 35. 



the MVU solution can be computed efficiently. This allows us to make a reasonable 
comparison of the MVU/ML estimator with other estimators. In a second set of exper- 
iments we compare the MVU with the ML estimator using a very simple cyclic MRP. 
In a final set of experiments we compare the contraction properties of iML and TD(0). 

4.1 Acyclic MRPs 

We performed three experiments for analyzing the estimators. In the first experiment 
we measured the MSE in dependence to the number of observed paths. In the second 
experiment we analyzed how the MRP structure affects the estimation performance. 
As we can see from Corollary 1 the performance difference between "MDP" based 
estimators such as TD or ML and model free estimators like MC depends on the ratio 
between the number of sequences hitting a state s itself and the number of sequences 
entering the subgraph of successor states without hitting s. We varied this ratio in the 
second experiment and measured the MSE. The third experiment was constructed to 
analyze the practical usefulness of the different estimators. We measured the MSE in 
relation to the calculation time. 

Basic Experimental Setup We generated randomly acyclic MRPs for the experiments. 
The generation process was the following: We started by defining a state s for which we 
want to estimate the value. Then we generated randomly a graph of successor states. 
We used different layers with a random number of states in each layer. Connections were 
only allowed between adjacent layers. Given these constraints, the transition matrix 
was generated randomly (uniform distribution). For the different experiments, a specific 
number of starts in state s was defined. Beside that, a number of starts in other states 
were defined. Starting states were all states in the first layers (typically the first 4). 
Other layers which were further apart from s were omitted as paths starting in these 
contribute few to the estimate, but consume computation time. The distribution over 
the starting states was chosen to be uniform. Finally, we randomly defined rewards for 
the different transitions (between and 1), while a small percentage (1 to 5 percent) 
got a high reward (reward 1000). Beside the reward definition, this class of MRPs 
contains a wide range of acyclic MRPs. We tested the performance (empirical MSE) 
of the ML, iML, MC and TD estimators. For the first two experiments the simulations 
were repeated 300 000 times for each parameter setting. We splitted these runs into 30 
blocks with 10 000 examples each and calculated the mean and standard deviation for 
these. In the third experiment we only calculated the mean using 10 000 examples. We 
used the modified TD(0) version which is unbiased with a learning rate of 1/i for each 
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state. The ML solution was computed at the end and not at each run. This means no 
intermediate estimates were available, which can be a drawback. We also calculated the 
standard TD(0) estimates. The difference to the modified TD(0) version is marginal 
and therefore we did not include the results in the plots. 




10 20 30 40 50 

Paths 



Fig. 3 MSE of ML, iML, TD(0) and MC in relation to the number of observed paths. The 
state space consisted of 10 layers with 20 states per layer. 



4-1.1 Experiment 1: MSE in Relation to the Number of Observed Paths 

In the first experiment, we analyzed the effect of the number of observed paths given 
a fixed rate of p s = 0.2 for starts in state s. The starting probability for state s is high 
and beneficial to MC (The effect of p s is analyzed in the second experiment). Apart 
from ML, all three estimators perform quite similarly with a small advantage for iML 
and MC (Figure 3). ML is even for few paths strongly superior and the estimate is 
already good for 10 paths. Note that, due to the scale the improvement of ML is hard 
to observe. 



4-1.2 Experiment 2: MSE in Relation to the Starting Probability 

In the second experiment we tested how strongly the different estimators use the 
Markov structure. To do so, we varied the ratio of starts in state s (the estimator 
state) to starts in the subgraph. The paths which start in the subgraph can only im- 
prove the estimation quality of state s if the Markov structure is used. Figure 4 shows 
the results of the simulations. The i-axis gives the number of starts in the subgraph 
while the number of starts in state s was set to 10. We increased the number expo- 
nentially. The exponential factor is printed on the x-axis. x = is equivalent to always 
start in s. One can see that the MC and ML estimator are equivalent if in each run the 
path starts in s. Furthermore, for this case MC outperforms TD due to the weighting 
problem of TD (Section 3.6.2). Finally, TD, iML and ML make a strong use of paths 
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2 4 6 8 

"Extra" Paths (2 k ) 

Fig. 4 MSE of ML, iML, TD(0) and MC in relation to the starting probability of the estimated 
state. The state space consisted of 10 layers with 20 states per layer. 



which does not visit state s itself. Therefore, TD becomes superior to MC for a higher 
number of paths. The initial plateau for the TD estimator appeared in the modified 
and the standard version. We assume that it is an effect of the one step error prop- 
agation of TD(0). For the one step error propagation a path starting in a state s in 
the ith layer can only improve the estimate if i paths are observed that span the gap 
between s and s . The probability of such an event is initially very small but increases 
with more paths. 



4-1.3 Experiment 3: MSE in Relation to Calculation Time 

In many practical cases the convergence speed per sample is not the important measure. 
It is the convergence speed per time that is important. The time needed for reaching 
a specific MSE level consists of the MSE for a given number of paths, the costs to 
calculate the estimate from the sample, and the costs for generating the paths. We 
constructed an experiment to evaluate this relation (Figure 5). We first tested which 
estimator is superior if only the pure estimator computation time is regarded (left part). 
For this specific MRP the MC estimator converges fastest in dependence of time. The 
rate for starts in state s was 0.2, which is an advantage for MC. The ratio will typically 
be much lower. The other three estimators seem to be more or less equivalent. In 
the second plot a constant cost of 1 was introduced for each path. Through this the 
pure computation time becomes less important while the needed number of paths for 
reaching a specific MSE level becomes relevant. As ML needs only very few paths, it 
becomes superior to the other estimators. Further, iML catches up on MC. For higher 
costs the estimators will be drawn further apart from ML (indicated by the arrow). 
The simulations suggest that MC or TD (dependent on the MRP) are a good choice if 
the path costs are low. For higher costs ML and iML are alternatives. 
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Calculation Time Calculation Time 

Fig. 5 MSE in relation to the computation time of the ML, iML, TD(0) and MC estimator. 
The left plot shows pure computation time (we excluded computation time for MRP calcula- 
tions like state changes). In the right plot, an extra factor for each observed path is included 
(one second per path). The state space consisted of 10 layers with 20 states per layer. We 
tracked for a given number of paths (ML: 10-50, iML, TD(0), MC: 10-1000) the MSE and the 
computation time. The plot was constructed with the mean values for every number of paths. 




Y Paths 

Fig. 6 A: The plot shows the difference in MSE between the ML estimator and the MVU 
(MSE(ML)- MSE(MVU)) for 10 paths and different values of 7 and p. In the top right part 
the MVU is superior and in the remaining part the ML estimator. B: The plot shows the MSE 
of the ML, the MVU and the MC estimator and the bias of ML in dependence of the number 
of paths for p = 7 = 0.9. 30 000 samples were used for the mean and the standard deviation 
(30 blocks with 1000 examples). 



4.2 Cyclic MRPs: MVU - ML Comparison 



Calculating the MVU is infeasible without some algebraic rearrangements. Yet, the 
algebraic rearrangements get tricky, even for simple MRPs. We therefore restrict the 
comparison of the MVU and the ML estimator to the simplest possible cyclic MRP, 
i.e. the MRP from Figure 2 (A) with rewards R n = 1 and R u = 0. The MC and ML 
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value estimates are 



1 " * . 1 1 " jiu+l 



n t—' 1-7 n ^ 1 

u=l j=o 



and 



1-7 1 - 7 1 - 7p 

where i u denotes the number of times the cycle has been taken in run u. The MVU 
sums the MC estimates over all consistent sets, i.e. over all vectors {k\, . . . , k n ) which 
fulfill X/«=i k u ~ s := X/u=i Let the normalization N being the size of this set and 
MC(fej) being the MC estimate for fcj cycles. The MVU is given by 

Yl MC(fci) + . . . + MC(kn) 

[k\ } ... } kn)=8 

s s — k± . . .—k n -2 

= ^E'-- E MC(h) + ... + MC(k n ), 

k 1= Q fc„-i=0 

where in the second line k n = s — fci . . . — fc n _i. The number of times k u takes a value 
j is independent of it, i.e. MC(fci) appears equally often as MC(fe,) if k\ = fcj. Hence, 
it is enough to consider MC{k\) and the MVU is 

fei=0 fe 2 =0 fc„_i=0 fei=0 

Finally, the coefficient is C(fci) = ( s+ "~L 2 2 ~ fel ) and the normalization is N = (^"J 1 )- 
The derivation can be done in the following way. First, observe that 1 = ( q). Then, 
that Efc^=r7 fc "~ 2 Co) = ( 1+ ( s - fel -- fc ™- 2 )) (e.g. rule 9 in (Aigner, 2006) p. 13). And 

finally thatEfc;":^"^ 3 ( 1+ <- fcl r - fc »-»>) = ^-1=0 ■ Crating the 

steps leads to the normalization and the coefficients. In summary the MVU is 

— V , y( s + n - 2 -^. ( i2) 

We compared the MVU to the ML estimator in two experiments. The results are 
shown in Figure 6. One can observe in Figure 6 (A) that high probabilities for cycles 
are beneficial for ML and that the discount which is most beneficial to ML depends 
on the probability for the cycle. We have seen in Section 3.4.4 that the Bellman equa- 
tion enforces the estimator to use all cycle times from to "oo" and thus in a sense 
"overestimates" the effect of the cycle. Furthermore, the probability for the cycle is 
underestimated by ML, i.e. < p (Section 3.4.4), which can be seen as a correction 
for the "overestimate" . The parameter estimate is independent of the true probability 
and the discount. Therefore, a parameter must exist which is most beneficial for ML, 
i.e. ML is biased towards this parameter. The experiment suggests that the most ben- 
eficial parameter p is close to 1, meaning that ML is biased towards systems with high 
probabilities for cycles. 

In Figure 6 (B) the results of the second experiment are shown. In this experiment 
7 = p = 0.9 and the number of paths is varied. One can observe that the difference 
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ML Contraction TD Contraction Random Update 




10 20 30 40 10 20 30 40 10 20 30 40 
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Fig. 7 The three plots show the contraction rate of different operators to the ML solution. The 
x-axis denotes the number of applications of the operators and the j/-axis shows the distance to 
the ML solution. The left and the center plot are normed with the initial distance (before the 
first application). Left: The Bellman operator is used. The discount 7 varies from 0.3 to 0.9. 
For each discount value the empirical distance and the bound (dotted line) is plotted. Center: 
Same setting as in the left plot but with the "TD(0)" operator. Right: In this plot 7 = 0.9. 
The ML curve corresponds again to the Bellman operator. For the other three curves only 
single states are updated with the Bellman operator, whereas the states which are updated are 
chosen randomly. The percent values denote the deviation from the uniform prior for the states 
(0% means uniform). For the single state curves not one update was performed per iteration 
but \S\ many. 



between the ML and the MVU estimator is marginal in comparison to the difference 
to the MC estimator. Furthermore, the bias of ML approaches quickly to and the 
MVU and the ML estimator become even more similar. 



4.3 Contraction: ML, iML and TD(0) 

In a final set of experiments we compared the contraction factor of different operators. 
We generated randomly transition matrices for a state space size of 100 and applied 
the different operators. The results are shown in Figure 7. The left plot shows the 
results for the usual Bellman operator and the bound for different discount values. 
In the middle the TD(0) update equation is used and in the right plot the Bellman 
operator is applied state wise, whereas the state is chosen randomly from different 
priors. The prior probabilities for states 1, . . . ,n := |S| are given by: p\ — (1 — c)m,p2 — 
(1 — c + l/(n — l))m, . . . ,p n — (1 + c)m, where m = 1/n (mean) and c denotes the 
deviation from the uniform prior. If c = then we have a uniform distribution. If 
c = 0.1 then pi — 0.9m, P2 = (0.9 + l/(n — l))m, . . . ,p n = 1.1m. 

While we were not able to proof that TD is in general inferior to ML, respectively to 
iML the plots suggest this to be the case for typical MRPs. Especially, the contraction of 
TD (middle plot) to the ML solution is magnitudes slower than the contraction using 
the Bellman operator. The state-wise update reduces the contraction speed further. 
The right plot shows the difference between the fixed point iteration and the state- wise 
update with the Bellman operator (corresponding to iML). The contraction factor of 
the state-wise update depends crucially on the distribution of visits of the different 
states. At best (i.e. uniform distribution over the states) the contraction is about |S|- 
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times slower than the contraction with the Bellman operator applied to the full state 
vector. 

5 Summary 

In this work we derived the MVU and compared it to different value estimators. In 
particular, we analyzed the relation between the MVU and the ML estimator. It turned 
out that the relation between these estimators is directly linked to the relation between 
the class of unbiased estimators and Bellman estimators. If the ML estimator is unbi- 
ased then it is equivalent to the MVU and more generally the difference between the 
estimators depends on the bias of ML. This relation is interesting, in particular as the 
estimators are based onto two very different algorithms and proving equivalence using 
combinatorial arguments is a challenging task. Furthermore, we demonstrated in this 
paper that the MC estimator is equivalent to the MVU in the undiscounted case if 
both estimators have the same amount of information. The relation to TD is harder 
to characterize. TD is essentially unbiased in the acyclic case and therefore inferior to 
the MVU and the ML estimator in this case. In the cyclic case TD is biased and our 
tools are not applicable. 

We want to conclude the section with open problems. Possibly, the most interesting 
problem is the derivation of an efficient MVU algorithm. The combinatorial problems 
that must be solved appear to be formidable. Therefore, it is astonishing that in the 
undiscounted case the calculation essentially boils down to calculating the ML esti- 
mate. In particular, the exponential runtime of a brute force MVU algorithm which 
is intractable even for simple MRPs decreases in this case to an 0(n 3 ) factor. This 
efficiency is mainly due to the irrelevance of the time at which a reward is observed. 
In the discounted case the time of an observation matters and the algorithmical diffi- 
culties increase considerably. Instead of the full geometric series of ML with arbitrary 
long paths it seems to be needed to make a cutoff at a maximum number of cycles, 
i.e. replacing (I — 7P) _1 with something like (I — 7P S )(I — 7P) -1 . Yet, Equation 12 
shows that a weighting factor is associated with each time step and the MVU equation 
is not that simple. 

Another interesting question concerns the bias of the ML estimator. We showed 
that the normalizations {JVj > 1} are the reason for the bias. Furthermore, if the Full 
Information Criterion applies then the normalization problem is not present and we 
used a theorem from (Singh and Sutton, 1996) to deduce unbiasedness of ML for this 
case. Yet, there seems to be a deeper reason for the unbiasedness of the ML estimator 
and the theorem from (Singh and Sutton, 1996) appears to be an implication from this 
and from Corollary 1 (MVU=MC). 

5.1 Discussion 

In the discussion section we address two questions: (1) What is the convergence speed 
of the MVU? (2) Which estimator is to be preferred in which setting? In this section 
the emphasis is put onto gaining intuition and not on mathematical rigor. 

Convergence Speed We are interested in the MSE and in the small deviation probability 
of the MVU. First, let us state the variance and the Bernstein inequality (e.g. (Lugosi, 
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2006)) for the first- visit MC estimator with n paths available for estimation: 

V[V {n) ] = MSE[V (n) ] = -V[R] 



and 

P(|y- (n ) - VI > e) < 2exp ( — ^ -\ , 

Vl P V 2V[H] + 2de/3J ' 

where V[ii] is the variance in the cumulated reward (see (Sobel, 1982) for the variance 
of a MRP) and d is an upper bound for the cumulated reward of any path, i.e. | ^ Rt — 
V\ < d. 

How about the MVU? In the undiscounted case the MVU has the same variance 
and small deviation probability if the Full Information Criterion applies. The quality 
increases with further paths into the graph of successor states. Intuitively, the im- 
provement in quality depends on the "distance" of the entry point s in the successor 
state graph to the state s of which we want to estimate the value. A natural distance 
measure for this setting is the probability to move from state s to s . Furthermore, 
the improvement will depend on the variation in the cumulative reward of paths start- 
ing in s' . Paths, that run through regions in which the reward has high variance will 
yield a better performance increase than paths which run through near deterministic 
regions. The performance will, however, be lower bounded by the case that all of these 
N paths start directly in a. Therefore, for undiscounted MRPs the rough lower bound 
{1/N)V[R] will hold: 

^¥[7?] < MSE[E[V (,i) |S]] < -V[R]. 

If starts in the successor graph are c times more often than starts in s, i.e. N = cn 
then 

^MSE[V {n) ] w MSE[E[V- (Tl) |S]]. 

Similarly, a "reasonable" Bernstein bound of the small deviation probability will lie 
between 

2eXP {- 2V[R]+2de/3 ) and 2eXP \T 2V[R] + 2de/3 ) ' 

Choosing an Estimator Our study shows that we have essentially a tradeoff between 
computation time and convergence speed per sample. As one would expect, the methods 
which converge faster have a higher computation time. It seems that the fast methods 
with bad convergence speed are superior if we consider pure computation time (Exper- 
iment 3, Section 4.1.3). However, if there are costs involved for producing examples, 
then the expansive methods become competitive. In a high cost scenario it currently 
seems best to choose the ML/LSTD estimator. The MVU might become an alternative, 
but an efficient algorithm is currently missing. Furthermore, the algorithmic problems 
restricted the numerical comparison to ML and it is unclear in which setting which 
estimator is superior. 
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A Notation 



l^ss' 


The number of direct transitions from state s to s'. 


7T 


r^arn. 




ith state in the path. 


11 ss' 


Set of till paths from state s to s^ . 


n(§) 


Set of paths that arc consistent with §. 


P,E, V 


Probability measure, expectation and variance. 


H s 


Sum of the reward received through transitions from state s. 


K a 


Number of visits of state s. 


Pss' 


Estimate of the probability for a direct transition from state s to s'. 


P , 

1 ss 


Estimate of the transition probability from state s to s'. 


Rs 


Estimator of the reward received through transitions from state s. 


§ 


State space. 


S,T 


Sufficient Statistics. 


Vs 


True value of state s. 


Vs 


Estimated value of state s. Concrete estimator is section dependent. 


Vs (t \P (t \,.. 

' ft ft' ' 


Superscripts denote values after the ith run. 


V,P,... 


Vectors and Matrices. 



B Unbiased TD(A) 

In this section we introduce a (minorly) modified TD(A) estimator. The estimates are, in 
contrast to the standard TD(A) estimator, independent of the initialization. In the acyclic case 
this is already enough to guarantee unbiascdness of TD(A). We first discuss the TD(0) case. 
This case contains the major arguments in an accessible form. 



B.l TD(0) 

We first restate the TD(0) equation through unfolding the recursive definition (eq. 3, p. 6). 
Lemma 3 If the TD(0) estimator is initialized with then for an acyclic MRP it equals 

n n 

k w = £&« <l) + E (E^'^v?- 15 ), 

where (3i := (cuYUj^i+lO- ~ * s ^ e received reward in path i and T i s i a random 

variable which is one if in run i the state s' followed upon state s and is zero otherwise. 

Proof The recursive TD(0) definition (cq. 3) can be written as: V s = V s — ct n ) + 

a n (R (n) +7K i ( "~ 1) ). Substituting V s ( " _1) : 

Vj n) = hVs {n ~ 2) (l - an-i) + a n -i(R( n -V +7 y('r 2) ))(l ~ a„) + On(flW + 7 V ( "" 1) ) 
= 7 V-i"" 2) (l - an-i)(l - a n ) + Q!„_i(l - a n )(R^ n - 1 '> + -yV^~ 2) ) + a n (R^ + 7 V ( "" 1) ) 

= -.. = £(<*• ft (l-%))(^ ) +7^ ) )= : Eft(^+7^r4)' 

i — l j — i + 1 i — l 

The estimate contains the values V^ which bias the estimator towards the initialization. The 

s 

estimator can be made unbiased for acyclic MRPs by excluding these values and by guarantying 
that the ft sum to one. Modification 1 does exactly this 2 . 

2 The modification is easy to implement with a TD(X) algorithm. A is set to if the successor 
state is initialized and it is set to 1 if the successor state is not. Further, the initial learning 
rate must be 1. 
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Modification 1 Modified TD(0) 

if Vs has seen no example then 

set the learning rate for this step to 1. 
end if 

if has seen no example then 
- u\ 

first update the estimate V, ' 
end if 

Use the TD(0) update rule (3). 



Setting the learning rate for the first example to 1 eliminates the initialization of V s . 
The second rule assures that the initialization of the estimators of the successor states is 
eliminated. Setting the learning rate «i to 1 has also the effect that the weighting factors 
0i sum to one, independent of the learning rate. For example for n=3, we have X)i = i Pi = 
1(1 - ci2)(l - eta) + «2(1 - as) + ag = 1. 

Theorem 7 The modified TD(0) estimator is unbiased if the MRP is acyclic. 

Proof We prove this by induction. We start with the terminal states for which V s = = V s 
holds. The induction step considers now the states which have only successors that have already 
been handled. This way the complete state space will be addressed. The expectation has the 
form (Lemma 3): 

n n 

i=l s'es i=l 

n n 

i— 1 s'eSi — 1 

n n 

i— 1 s'eS i— 1 

It remains to show that e[V s ( / X) \Ks = nj is unbiased. For i > 2 this follows from the 
induction hypothesis. For the case i = 1 Modification 1 guarantees that the estimator has at 
least one example for estimation and is unbiased due to the induction hypothesis. Furthermore, 
the f3i's sum to one due to the modification and J2 g , g sP ss '7V s / Ysi=l Pi = • 




B.2 TD(A) 

The TD(A) case is essentially the same. The main difference is that the estimates of all states 
of a path are used. Therefore, it is not enough that the estimators of the direct successor states 
are set to "reasonable" values, but all states of the path must be: 



Modification 2 Modified TD(A) 

if Vs has seen no example then 

set the learning rate for this step to 1. 
end if 

if for a successor s' in the path V , has seen no example then 

first update the estimate Vy 
end if 

Use the TD(A) update rule. 
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Theorem 8 The modified TD(\) estimator is unbiased if the MRP is acyclic. 

Proof Proof by induction. Induction Hypothesis: E[V s |.Ks = n] = V s . 
Induction Basis: For terminal states the Hypothesis trivialy holds. 

Induction Step: Let n(i,j) be state i in path j and let R^ti^) De * ne reward received at 
state j in run i. In the acyclic case TD(A) can be written as 



V} n > = (1 - a^v}"-^ + a n ( ^( 7 A) 1 - 1 ^ ( ,,„ ) + 7 (1 - A) ^( 7 A) 1 " 2 9 <i<n) 

\t=l i=2 

(1 - ai )V s W + f> fe(^) <-1 ^«,i) +7(1 - A)E(7A) i - 2 K 



/■(«-!) 



3=1 \i=l 



T (*.i) 



We suppressed the "iteration" index of V^^ ^ for readability. Like in the TD(0) case /3j 
\ a j llfc=j+i(l ~ a k)j ■ Applying the expectation operator and using ct\ = 1, we get 



E[V s {n) \K s = n] = E 



E ft E(7^)'" 1 ^(i, J ) + 7(1 ~ A) E(7A) i - 2 V; (! ,„) 

3=1 \»=1 !=2 



K s = n 



E ft E(7A) 1 - 1 E[fi 7r(l ,, ) |i<' s = n] + 7(1 - A) Y^W~ 2 ®\?*m\ K ' = n 

j = l \i=l i=2 



(13) 



Instead of E[ii^(jj)] and E[V^.(j j-j] we use E[f?i] and E[Vi] in the following to denote the 
expected reward in step i, respectively the expected value estimate in step i (expected state 
times expected value estimate for that state). Due to the induction hypothesis 

E[Vi\K 3 =n] = V l =Ey _i E[iij-]. 

j=i 

Substituting this term into equation 13: 



E ft E(7A) i " 1 E[R l ] + 7(1 - A) E(7A) 1 - 2 E7 i_i E[i? j 
3=1 \»=1 i=2 j=i 

Taking a specific E[i?i], we see that for the coefficient 

( 7 A)' _1 + 7(1 - A)(7 I - 2 A 1 - 2 + . . . + 7 l - 2 A + 1) 

= 7 i " 1 (^-^(l-^E^j =7'- 1 (x^ + il-X) 1 '^ 



holds. We know already that the /3j sum to one. Hence, the modified TD(A) is unbiased. 



C Proofs 



C.l Unbiased Estimators - Bellman Equation 



Lemma 1 For the MRP from Figure 2 (B) there exists no parameter estimator p such that 
Vi(p) is unbiased for all states i. 

Proof Assume that Vi, Vi are unbiased, i.e. E[Vj|{7Vi > 1}] = TE[V 2 ] = Vi = V2 and the 
estimator fulfills the Bellman equation, i.e. Vi = V 2 on TVi := {iVi > 1}. Then 

^ r ,-. 1 . — , Bcllm. „ rT ""r 1 at n unb. _ . Bellm. , r unb. „ rT ~> i 

E[V 2 |iVi] = E[Vi|7Vi] = Vi = V2 = E[V 2 ]. 
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We used for the first equality that pi2 must equal 1 as there is only one connection leading 
away from state 1. The derived equality shows that the average value estimate V 2 must be the 
same as the average estimate for the case that the connection 2 — > 1 has been taken at least 
once. This implies that the value estimate for the case that only the connection 2 — ► 3 has 
been taken must be the same as the average value estimate: 

E[Va] = E^IA^P^i] + E[V 2 \Nf]P[Nf] u =' E[^ 3 ]P[JVi] + W,[V 2 \Nf}TP[Nf], 

where N[ denotes the event Ni = 0. This implies that E[V^|JVi] = E[T7 2 ] = E^liVf]. 

There are two possibilities to achieve this equality: (1) All three terms are 0. In particular, 
E[V 2 ] = ^ V 2 . This contradicts unbiasedness. (2)E[V2|JVf] ^ 0: As R 23 = that means that 
P21 7^ 0, despite the fact that this transition has not been observed. Furthermore, this implies 
that P12 = 1 as otherwise no valid MRP is defined. As a consequence V\ = V2 on both the 
events JVi and JVf , in particular E[V'i|A r j] = E[V2|JVJ] ^ 0. Now, we get a contradiction with 
the following argument: 

Vi + nVi\N?] u =- E[Vi|JVi] + TE[Vi\NZ] = E[V 2 |JVi] + Ep^lATf] = E[t7 2 ] u =' V 2 B = m - V u 

as the equality can only hold if E[Vi|-/Vf] = 0. 

Lemma 2 For the MRP from Figure 2 (A) and for n = 1 there exists no parameter 
estimator p that is independent of 7 such that V(p) is unbiased for all parameters p and all 
discounts 7. 

Proof For V(p) to be unbiased, it must hold that 



E 



c 1 -p)Yl^ ip 



a - P ) y, 7V ^ ^ 7 1 - p)f] - (1 - P )in = 0. 



If the equality holds for all 7 £ (0, 1), then E[(l — p)p'] = (1 — p)p l for i > 0. Otherwise, 
with Xi := E[(l — p)p t ] — (1 — p)p* and x n being the first term different from (|ain| > 0): 
\f n x n \ = lEfcU+i^il and therefore \x„\ = |7E£„+i 7 I_(n+1) ^|. We can now adjust the 
discount to downscale the right hand side arbitrary low, while the left side stays unaffected. 
The sequence \xi\ is bounded, i.e. \xi\ < max{E[(l — p)p l ],(l — p)p 1 } for all i. Futhermore, 
both terms are bounded by max ag [g — a)d l . The maximum is reached for a = i/(i + 1) 
and the maximal value over all i is reached for i = 0. The value for i = is 1. Therefore, 

00 00 

|7 £ 7 l - ( " +1) ^l<7 £ 7*- ( " +1) l = -->—. 
i=n+l i=n+l * ^' 

For 7 < — |x n |) the term and the remaining part of the sum becomes smaller than 

\xn\- |a:n|/(l/4 — \x n \) is always larger than as \x n \ > and for |a!n| — > 1 the discount 7 can 
be chosen arbitrary large. In summary this contradicts the assumption and E[(l — p)p l ] must 
equal (1 — p)p l for all i > 0. 

Therefore, E[l - p] = 1 - p => E[p] = p and E[(l - p)p] = (1 - p)p => E[p 2 ] = p 2 . 
Consequently, p must be a constant. Otherwise, we get a contradiction with the following 
argument: The possible values of p are countable (countable many outcomes). We denote the 
values with and with qi the probabilities for the values a{. From E[p 2 ] = E[p] 2 it follows 
that J2iloH a i = E£=o Ej=-0 H<lj<Haj => E£=o Ej^i <li<lj<HO,j = 0. Furthermore, q it ai > 
for all i and therefore qiqjaiaj = for all i 7^ j. As a consequence, there can be only one > 
with qi > 0. Because E[p] = p it holds that a; = p/gi and because E[p 2 ] = p 2 it holds that 
ai = pi sfqi. Hence, = 1 and the parameter estimate is almost surely a constant p. Hence, 
for a MRP with p = p/2 the estimator will not be unbiased. 



C.2 Markov Reward Process 



Lemma 4 A MRP with finite state space and iid sequences forms an s-dimensional exponen- 
tial family, where s is the number of free MRP parameters. 
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Proof Firstly, we demonstrate that the transition distribution forms an exponential family. 
The density can be written as 



P(Xi =tt( 1 \...,X, =ttW) = f[PZi = exp(]Tc i lo g P 7ri 

=1 



with 7rW being the observed paths, (7T;);g]N the set of paths, Cj the number of times path i 
has occurred and P n the probability of path it. The parameters P n are redundant. We explore 
now the MRP structure to find natural parameters that are not functionally dependent. The 
size of this set of parameters is the number of necessary MRP parameters, that is 

((Starting States — 1 + (JDirect Successors of i — l) . 

ig§ 

We reformulate the exponential expression to reduce the number of parameters. First, one 
can observe that P** i s equivalent to I~I;gs (p7' Tlj&iPij^ji where m is the number of 
starts in state i. The parameters are still redundant: Let state 1 be a starting state and S the 
remaining set of starting states, then pi = 1 — Ylj^sPj- Furthermore, we have one redundant 
parameter pij for every state i. The first problem can be overcome by using A(9) in the following 
way: n log (l — ^Zigs Pi) + Eigs n i 1°S 7 ~ r- Here, A(8) equals the n term and m is 

the number of starts in state i. Using the same approach for the transition parameters results in 

Ki log ( 1 - J2jcs(i) Pij ) + E^gs Mij lo S 7 — V' witn ^(*) being the set of successor 

v {i-22ues(i)Piu) 

states of i without the first successor. This time the Ki term cannot be moved into A(8), as 

Ki is data dependent. This problem can be overcome by observing that Ki = ni + Sjgg Vji 

and by splitting the Ki terms. As a result we get 

/ , /. r v~-> V v , w(i-E„ eS( i)W«) 

cxp n-log 1 - > ,pi 1 - 2 „ Pi" + / , n i lo S ■ 



„ „ Pij y- ~ E uS s(j)Pi«; 

2^ }^ ^ lo g 



• j£S(i) ^1 — EneS(i) PiuJ 



If the reward is deterministic and the examples consist of state sequences, then the MRP forms 
an exponential family. If the reward is a random variable then it depends on the distribution 
of this random variable. In many cases, like for the binomial or multinomial distribution, the 
resulting MRP still forms an exponential family. 



C.3 MVU 

Theorem 4 E[V|S] converges on average to the true value. Furthermore, it converges almost 
surely if the MC value estimate is upper bounded by a random variable Yah 1 . 

Proof The estimator converges on average, because E[|E[V|S]-V|] < E[|V-V|] —T 0, where 
n denotes the number of observed paths and convergence follows from the MC convergence. The 
inequality follows from the Rao-Blackwcll Theorem, respectively from the Jensen inequality 
because | ■ | is convex. 

We need to show that 

lim E[V|S] = V a.s., 

n — >oo 

where n denotes again the number of observed paths for almost sure convergence. 

We use a statement from (Bauer and Burckel, 1995)[§15 Conditional Expectation, (15.14)], 
which says that linin^-xj E[V|S] = V a.s. if V converges almost surely and if it is upper 
bounded by a random variable Y £ L 1 . The upper bound comes from the Lebesgue convergence 
Theorem and the statement uses that for the conditional expectation if holds that lim V = 
V a.s. implies EflimV'IS] = V a.s.. 
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C.4 ML Estimator 

Theorem 5 The ML estimator is unbiased if the MRP is acyclic. 
Proof The value function can be written as 

where TI sg i is the set of paths from s to s', P w the probability of path n, \n\ the length 
of the path and E[-R s ] = 53 3 /ggP ss 'E[.R 3S /]. The ML estimator can be written in the same 
form, whereas P^ is replaced with P v := YliP^i^i+i an d the expected reward with the reward 
estimator. The sample mean estimator p is unbiased and the reward estimator is unbiased 
because of our initial assumption. The main problem is to show that P n is unbiased, i.e. that 

E[fJp7r i7 r i+1 |^ S =n] =nPvr,7r I + 1 . 

i i 

The last of these estimators (denote it with Pg~) is conditionally independent of the others 
given the number of visits of state s (Kg). This is also the main point where acyclicity is 
needed. Using this together with the law of total probability and the fact that p is unbiased, 
leads to the following statement (with L being the length of the path ir): 

L-l n L-l 

E [II P^i^i+i\ K " = n ] = IZ E [n P^i^i+i\ K ' = n > K e = l ] p l K s = = n] 
1=1 i=l 1=1 

7i L-2 

l =I^ E [n P*m+i\ K s =^,K S =i\p sS P[K s =l\K s =n] 

i = l i = l 

n L-2 L-2 

= PiiJ2 ]E [ll P"i«i+i\ K s = n > K s =l]v[K s = l\K B = n] =p 3 £&[Yl PTm+i\ K s = n ]- 

1 = 1 i = l i = l 

We used that for I = the last estimator p in the product is zero. The procedure has to 
be repeated for every p. As a result the expectation of this estimator is equal to the path 
probability. One can handle the reward estimator with the same procedure. In summary we 
find that the value estimator is unbiased. 



D Counterexamples 

D.l MC - TD 

We present two examples in this section. In the first example MC has a lower MSE than TD(0) 
and is at least as good as TD(A) for every A. In the second example TD(0) is superior to MC. 

D.l.l MC Superior to TD 

Figure 8 (A) shows an example for which the MC estimator is superior. We assume that the 
learning rate cti of TD(0) is between and 1, that the learning rate in the first step is 1 
(cti = 1) and that the estimator is initialized to (we use this assumption for readability, 
it is also possible to use the unbiased TD(0) estimator (Modification 1)). Let state 1 be the 
starting state, n be the number of observed paths and let 7 = 1 for simplicity. 

The MC estimator for state 2 is 1/n J^ILi Yi, where Y{ = R23 or Yj = R24 are the rewards 

received after a transition from state 2 to state 3 or 4. For state 1 we obtain 1/n £7=1 (Yi+R$), 

where the Yj are the same as before and ii^ 2 is the received reward after a transition to state 
2. The MC estimator is a weighted average of the examples and it is the optimal unbiased 
linear estimator (eq. 2) as on = 1/n for all i. 
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A B i? 34 = l R 35 = -1 




Fig. 8 A: A MRP for which TD is inferior to MC. The transition from state 1 to state 2 is 
followed by a reward R12 = +1 and R12 = —1 with probability p = 0.5 each. B: A MRP for 
which MC is inferior to TD. No reward is received for transitions 1 — > 2 and 1 — > 3. p\ and P2 
are the probabilities to start in state 1 and 2. 



We now analyze the TD(0) estimator. Consider two different sequences on and a;, i = 
1, . . . ,n, of learning rates for the TD(0) estimators V\ and V%. The TD(0) estimator V2 can 
be written as (Lemma 3, Appendix B) 

n n n 

^ (n) =E0 EI 

i — l j — i+1 i—l 

The estimator is unbiased and has minimal variance if and only if (3i = 1/n. This can be 
enforced by choosing oti = For state 1 we obtain 

n n i — l 

i — l i — l 3 — 1 

n n — ln n n — 1 

= (Eft R S) + (E(ft E ft)n)=:(Eft^) + (E^)' 

i — l i — l j — i + 1 i — l i — l 

where /3j = a.i n?=i+i(^ ~ Using the Bienayme equality (e.g. (Bauer and Burckel, 1995)) 
the variance of the estimator takes the following form 

n n—1 n n—1 

V(t») = d V(£>flW) + v( £ 7i y i} ^ V(H«)E^I + v(n) E 1?. 

i — l i—l i — l i — l 

where "ind" abbreviates "independence". Y\ and have the same variance. With -y n = 

n ?i 

v(y 1 <"») = v(y 1 )(EA 2 + E^ 2 )- 

i = l i=l 

Because < ft,7i < 1 and 532=1 ft = 5Z"=i T« = 1 ( seG Appendix B) this term would 
be minimal if and only if ft = 7i = 1/n. From ft = ft = however, it follows that 

7; = l/nJI]"^ 1 l/n = (n — j — 2)/n 2 ^ 1/n. Hence optimality cannot be achieved. Since both 
MC and TD are unbiased, we obtain MSE[MC] < MSE [TD], 

This example demonstrates a major weakness of TD, namely that it is impossible for TD 
to weight the observed paths equally, even for simple MRPs. Furthermore, MC is for this 
example the optimal unbiased value estimator and TD(A) is unbiased. The optimality of MC 
is a direct implication of Corollary 7 from Section 3.5. Therefore MSE[MC] < MSE[TD(A)] 
for each A. 
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D. 1.2 TD Superior to MC 

Figure 8 (B) shows an example where TD(0) is superior. Let the number of observed paths be 
n = 2 and 7 = 1. The value of all states is zero. TD(0) and MC are unbiased for this example. 
The variance of the MC estimator for states 1,2 and 3 is therefore given by 

E[V 3 2 ] =P[ J R( 1 > = 1, _R< 2 > = 1] ■ I 2 + P[RW = _i, ft(2) = -1] • (-1) 2 
+ p[RM = i j ij(2) = _i] . o + P[_R« = -1, = 1] ■ 
=(1/4)1 2 + (l/4)(-l) 2 + (2/4) • = 1/2, 
E[V\ 2 ] =E[U 2 2 ] = (1/4)1/2 + (1/2)1 + = 5/8, 

where i?W denotes the received reward in run i. The first term in the second line results from 
starting two times in state 1 or 2 and the second term in the second line from a single start in 
state 1. Setting the learning rate aj to on = 4 for TD, the estimator for state 3 is equivalent to 
the corresponding MC estimator and therefore the variance is 1/2. In the first run the standard 
TD(0) update rule uses the initialization value of state 3 to calculate the estimate in state 1 
or 2. This is advantageous and results in a variance of 1/2. Without exploiting this advantage 
the variance is 17/32. This is still lower than the variance of the MC estimator. Since both 
estimators are unbiased we obtain MSE[TD(0)] < MSE[MC]. 



D.2 MVU/MC - ML 

We show by means of counterexamples that neither the MVU is superior to the ML estimator 
nor is the ML estimator superior to the MVU or to the MC estimator. We use again the 
MRP from Figure 2 (A) on page 11 with n = 1. As we showed before, the value for state 1 is 
(1 — p)/(l — jp) and the ML estimate is (1 — p)/(l — 7p~), where p = + 1) and i denotes 
the number of times the cyclic connection has been taken. The MC estimate and therefore 
the MVU estimate is given by 7 ! . Because of the unbiascdness of the MVU/MC estimator the 
MSE is given by: 

MSE[Vi] = E[U 2 ] - V? = (1 - p) f> 2 ¥ - = n P( \' P) 2 r (l - 7) 2 , 

~^ (1-7P) 2 (1 - 7P) 2 (1 - 7 2 P) 

where Vi denotes the MVU/MC estimator. For the MSE of the ML estimator Vi we need to 
calculate the first and the second moment. The first moment: 



Eft] = (l - P ) E ^—^P* = (1 - P) £ t-77 Y l 

fej 1 - 7P fzf, 1 + (1 - 7)* 



In the following, we chose 7 such that (1 — 7) 1 = m £ N. The sum can then be written as 
ra(l - p) -y- 1 ra+i = m ( l ~P) f-y- Vl_ _ y- 1 = m(l - p) / 1 _ yv 1 p 

2-^1 m _l- i r>m I 2—1 A 2—1 : I „m \ 1 _ n 2—1 i 



P m m + i p m \ f"~f i i / p m \ 1 — p r~i i 

r z=Q \i=l i—i / \ i— l 

The second moment: 



£5 (1 - 7P) 2 ^ (1 + (1 - 7>) 



V 



[l~p)m 2 y 1 m+i _ (1 - p)m 2 (yP^__ y^ 1 P 



— ' (m + i) 2 p m \ ^—i i 

-0 \%= 1 



2 
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The infinite sum is called Spence function or dilogarithm and is denoted with Li2(p). Using 
these terms one can derive the MSE: 

(l-p) 2 m 2 / m (i-p)+p / Y^P*\ 2 1 

(m(l - p) + p) 

For 7 = p = 1/2 the MSE of the MVU/MC estimator is 0.127 and 0.072 for the ML estimator. 
Contrary, for p = 0.99 the MSE of the MVU/MC estimator is 0.0129 and 0.0219 for the ML 
estimator. 
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